Maria’s Cleaning

Goals of this notebook

The steps we’ll take to prepare our data:

  • Download the data
  • Import it into our notebook
  • Clean up data types and columns
  • Export the data for next notebook

Setup

Loading the libraries

library(tidyverse)
library(janitor)

Downloading the Data

In this section of the cleaning notebook, we’re loading the VERSA program data. This data accounts for arrests made under Operation Lone Star starting in January 2023.

Here is the summary of the differences between SPURS system and VERSA from the Depart of Public Safety Communications:

“Please note that arrests are now documented on two spreadsheets due to the implementation of a new record management system in 2023 by the Texas Highway Patrol (THP). Prior to January 2023, THP utilized the Enterprise Management (EM) System for traffic enforcement activity (written warnings, citations, and CMV inspections) and SPURS for documenting entrusted property and criminal case reports. Beginning in January of 2023, THP underwent a statewide rollout of the Versaterm Software Suite, completed in August 2023, which serves as THP’s only RMS system for regular traffic enforcement activity, entrusted property, as well as general and arrest offense report entry.”

More information about Operation Lone Star is found in the index page of this website.

The data was obtained by a Public Records Request from KUT.

Importing the Data

Here we’re importing the VERSA data spreadsheet that I already have downloaded to my “data-raw” folder. Then, we’re adding our data to a code chunk. For simplicity it will be called “versa_analysis,” which we will then peak at in order to look at what our data looks like in order to clean.

versa_analysis<- read_csv("data-raw/versa_raw_data.csv") |> clean_names()
Rows: 17256 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Arrest Date, County Name, Primary Key, Jurisdiction, Statute, Clas...
dbl  (3): Arrest Officer1, Charge Count, Apparent Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
versa_analysis
versa_analysis |> glimpse()
Rows: 17,256
Columns: 18
$ arrest_date         <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name         <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key         <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction        <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1     <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute             <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class               <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge              <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one           <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two           <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname  <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity           <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race                <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code         <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age        <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …

Notes for VERSA cleaning:

  • “Given three” refers to two middle names, only very few individuals in the list have this listed
  • Years are given as “apparent age” changed this to the actual age number to line up to SPURS “Person Age”

Fixing the dates

We want to make sure that our dates are in month, date, year format for consistency in our analysis.

versa_analysis_date <- versa_analysis |>
 mutate(
   charge_date = mdy(arrest_date)
  )

versa_analysis_date |> glimpse()
Rows: 17,256
Columns: 19
$ arrest_date         <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name         <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key         <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction        <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1     <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute             <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class               <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge              <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one           <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two           <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname  <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity           <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race                <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code         <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age        <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date         <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
versa_analysis_date

Summary Stats:

versa_analysis_date |> summary()
 arrest_date        county_name        primary_key        jurisdiction      
 Length:17256       Length:17256       Length:17256       Length:17256      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 arrest_officer1   statute             class           severity_code_group
 Min.   : 6154   Length:17256       Length:17256       Length:17256       
 1st Qu.:14978   Class :character   Class :character   Class :character   
 Median :15752   Mode  :character   Mode  :character   Mode  :character   
 Mean   :15295                                                            
 3rd Qu.:16099                                                            
 Max.   :16551                                                            
 NA's   :18                                                               
    charge           charge_count     given_one         given_three       
 Length:17256       Min.   : 0.000   Length:17256       Length:17256      
 Class :character   1st Qu.: 1.000   Class :character   Class :character  
 Mode  :character   Median : 1.000   Mode  :character   Mode  :character  
                    Mean   : 1.217                                        
                    3rd Qu.: 1.000                                        
                    Max.   :60.000                                        
                                                                          
  given_two         individual_surname  ethnicity             race          
 Length:17256       Length:17256       Length:17256       Length:17256      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 gender_code         apparent_age   charge_date        
 Length:17256       Min.   :1944   Min.   :2023-01-01  
 Class :character   1st Qu.:1986   1st Qu.:2023-06-30  
 Mode  :character   Median :1995   Median :2023-10-19  
                    Mean   :1993   Mean   :2023-10-24  
                    3rd Qu.:2001   3rd Qu.:2024-03-02  
                    Max.   :2023   Max.   :2024-07-11  
                    NA's   :6073                       

Fixing the age

We want our VERSA data to match up to SPURS when it comes the the “Person Age” data column. In order to do that we subtract this year’s date minus the “apparent_age” column of the VERSA data. This gives us the age in the same numerical format.

versa_analysis_age <- versa_analysis_date |>
  mutate(
    person_age = 2024- apparent_age
    )

versa_analysis_age |> glimpse()
Rows: 17,256
Columns: 20
$ arrest_date         <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name         <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key         <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction        <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1     <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute             <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class               <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge              <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one           <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two           <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname  <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity           <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race                <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code         <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age        <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date         <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
$ person_age          <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25…

Fixing Names

We want to change the columns for the names of the offenders, so it matches up to the SPURS data set. Theb we want to change the officer column to “officer” and

versa_analysis_names <- versa_analysis_age |>
  mutate(
    person_first_name = given_one, 
    person_second_name = given_two, 
    person_third_name = given_three, 
    person_last_name = individual_surname,
    officer_id = arrest_officer1, 
    arrest_county = county_name, 
    code = class,
    severity_code = severity_code_group, 
    person_race_abbr = race, 
    person_gender_abbr = gender_code, 
    )

versa_analysis_names |> glimpse()
Rows: 17,256
Columns: 30
$ arrest_date         <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name         <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key         <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction        <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1     <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute             <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class               <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge              <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one           <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two           <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname  <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity           <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race                <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code         <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age        <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date         <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
$ person_age          <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25…
$ person_first_name   <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ person_second_name  <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ person_third_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ person_last_name    <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ officer_id          <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ arrest_county       <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ code                <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code       <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ person_race_abbr    <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ person_gender_abbr  <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …

Create new column to match SPURS race column

Because the SPURS dataset uses the “race” column to identify both ethnicity and race, we created a new column that mirrors the SPURS way of documenting people’s profiles.

versa_analysis_new <- versa_analysis_names |>
  mutate(
    person_race_abbr = if_else(ethnicity== "H", ethnicity, person_race_abbr), 
    arrest_state = case_match(jurisdiction, "TX" ~ "TEXAS")
  )

versa_analysis_new |> glimpse()
Rows: 17,256
Columns: 31
$ arrest_date         <chr> "1/3/2023", "1/3/2023", "1/21/2023", "1/22/2023", …
$ county_name         <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ primary_key         <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX202…
$ jurisdiction        <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", "T…
$ arrest_officer1     <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ statute             <chr> "481.121(b)(1)", "481.116(c)", "15", "30.05(d)(1)"…
$ class               <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code_group <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ charge              <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WA…
$ charge_count        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ given_one           <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ given_three         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ given_two           <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ individual_surname  <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ ethnicity           <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ race                <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", …
$ gender_code         <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ apparent_age        <dbl> 1997, 1997, 1974, 1987, NA, 1998, 2003, NA, 1995, …
$ charge_date         <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 2…
$ person_age          <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25…
$ person_first_name   <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", …
$ person_second_name  <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "…
$ person_third_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ person_last_name    <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGU…
$ officer_id          <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 1…
$ arrest_county       <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", …
$ code                <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "P…
$ severity_code       <chr> "Versa Misdemeanor Charges", "VT Felony Charges", …
$ person_race_abbr    <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", …
$ person_gender_abbr  <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", …
$ arrest_state        <chr> "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXA…

Filter out unnnessary columns

After cleaning the names of the columns, I want to remove the columns with the old column names.I will use the select function to select the columns I don’t want.

versa_analysis_filter <- versa_analysis_new |>
 select(-c(apparent_age, given_one, given_two, given_three, individual_surname, arrest_officer1, jurisdiction, county_name, arrest_date, statute, class, severity_code_group, race, gender_code))

versa_analysis_filter |> glimpse()
Rows: 17,256
Columns: 17
$ primary_key        <chr> "TX20234", "TX20234", "TX20235", "TX20237", "TX2023…
$ charge             <chr> "Poss Marij < 2OZ", "Poss CS PG 2 >= 1G < 4G", "WAR…
$ charge_count       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ ethnicity          <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "…
$ charge_date        <date> 2023-01-03, 2023-01-03, 2023-01-21, 2023-01-22, 20…
$ person_age         <dbl> 27, 27, 50, 37, NA, 26, 21, NA, 29, NA, 30, 21, 25,…
$ person_first_name  <chr> "CRISPIN", "CRISPIN", "MOISES", "FERMIN", "JOSE", "…
$ person_second_name <chr> "III", "III", NA, NA, "CARLOS", NA, "GUADALUPE", "E…
$ person_third_name  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ person_last_name   <chr> "MERCADO", "MERCADO", "DE LA CRUZ LOPEZ", "RODRIGUE…
$ officer_id         <dbl> 15882, 15882, 15969, 15001, 15001, 15001, 15001, 15…
$ arrest_county      <chr> "Dimmit", "Dimmit", "Dimmit", "Kinney", "Kinney", "…
$ code               <chr> "HS", "HS", "CP", "PC", "PC", "PC", "PC", "PC", "PC…
$ severity_code      <chr> "Versa Misdemeanor Charges", "VT Felony Charges", "…
$ person_race_abbr   <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H", "H", "…
$ person_gender_abbr <chr> "M", "M", "M", "M", "M", "F", "M", "M", "M", "M", "…
$ arrest_state       <chr> "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXAS", "TEXAS…

Removing extra labels in “severity_code” and creating clean column

We noted that Versa and SPURS record charges differently under “severity_code.” We aren’t too sure if the Versa and VT label is equivalent to SPURS labels so we created a new column for cleaning purposes and we won’t remove the original column. This new column will be “severity_code_clean.”

versa_code_clean <- versa_analysis_filter |>
  mutate(
    severity_code_clean = case_match(severity_code, "Versa Misdemeanor Charges" ~ "Misdemeanor Charges", "VT Felony Charges" ~ "Felony Charges", "VT Federal/Other Charges" ~ "Federal/Other Charges"), 
    .after = severity_code
  )

versa_code_clean 

Importing cleaned data

 versa_code_clean |>
  write_rds("data-processed/01-clean-versa.rds")